| Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
|---|---|---|---|---|---|---|---|---|
| age_resp | 5 | 0 | 14.0 | 1.4 | 12.0 | 14.0 | 16.0 | |
| bdate_y | 5 | 0 | 1982.0 | 1.4 | 1980.0 | 1982.0 | 1984.0 | |
| pincome | 2234 | 27 | 46483.5 | 42112.7 | 0.0 | 37900.0 | 246474.0 | |
| pnetworth | 2507 | 26 | 90429.7 | 137578.8 | −935251.0 | 34500.0 | 600000.0 | |
| retsav1 | 228 | 75 | 44237.2 | 63017.9 | 0.0 | 20000.0 | 300000.0 | |
| mom_age_birth | 31 | 7 | 25.5 | 5.4 | 16.0 | 25.0 | 45.0 | |
| mom_age | 35 | 7 | 39.5 | 5.5 | 28.0 | 39.0 | 61.0 | |
| has_retsav | 3 | 13 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | |
| owns_home | 3 | 18 | 0.6 | 0.5 | 0.0 | 1.0 | 1.0 | |
| both_parents | 2 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 |
NLSY Parent Income and Wealth Imputation
Data Summary
| N | % | ||
|---|---|---|---|
| mom_educ_hs | Less than high school | 1779 | 20.0 |
| High-school graduate | 2860 | 32.1 | |
| Some college | 1877 | 21.1 | |
| College degree | 1057 | 11.9 | |
| Graduate degree | 284 | 3.2 | |
| dad_educ_hs | Less than high school | 1142 | 12.8 |
| High-school graduate | 1913 | 21.5 | |
| Some college | 1224 | 13.8 | |
| College degree | 894 | 10.0 | |
| Graduate degree | 338 | 3.8 | |
| race_eth | AAPI Hispanic | 4 | 0.0 |
| AAPI NonHispanic | 156 | 1.8 | |
| AIAN Hispanic | 18 | 0.2 | |
| AIAN NonHispanic | 43 | 0.5 | |
| Black Hispanic | 54 | 0.6 | |
| Black NonHispanic | 2333 | 26.2 | |
| Other Race Hispanic | 944 | 10.6 | |
| Other Race NonHispanic | 119 | 1.3 | |
| White Hispanic | 819 | 9.2 | |
| White NonHispanic | 4406 | 49.5 |
Imputation of Parental Income and Networth
Choices
From mice package documentation:
- First, we should decide whether the missing at random (MAR) assumption (Rubin 1976) is plausible. The MAR assumption is a suitable starting point in many practical cases, but there are also cases where the assumption is suspect. Schafer (1997, pp. 20–23) provides a good set of practical examples. MICE can handle both MAR and missing not at random (MNAR). Multiple imputation under MNAR requires additional modeling assumptions that influence the generated imputations. There are many ways to do this. We refer to Section 6.2 for an example of how that could be realized.
- The second choice refers to the form of the imputation model. The form encompasses both the structural part and the assumed error distribution. Within MICE the form needs to be specified for each incomplete column in the data. The choice will be steered by the scale of the dependent variable (i.e., the variable to be imputed), and preferably incorporates knowledge about the relation between the variables. Section 3.2 describes the possibilities within mice 2.9.
- Our third choice concerns the set of variables to include as predictors into the imputation model. The general advice is to include as many relevant variables as possible including their interactions (Collins et al. 2001). This may however lead to unwieldy model specifications that could easily get out of hand. Section 3.3 describes the facilities within mice 2.9 for selecting the predictor set.
- The fourth choice is whether we should impute variables that are functions of other (incomplete) variables. Many data sets contain transformed variables, sum scores, interaction variables, ratios, and so on. It can be useful to incorporate the transformed variables into the multiple imputation algorithm. Section 3.4 describes how mice 2.9 deals with this situation using passive imputation.
- The fifth choice concerns the order in which variables should be imputed. Several strategies are possible, each with their respective pro’s and cons. Section 3.6 shows how the visitation scheme of the MICE algorithm within mice 2.9 is under control of the user
- The sixth choice concerns the setup of the starting imputations and the number of iterations. The convergence of the MICE algorithm can be monitored in many ways. Section 4.3 outlines some techniques in mice 2.9 that assist in this task.
- The seventh choice is m, the number of multiply imputed data sets. Setting m too low may result in large simulation error, especially if the fraction of missing information is high.
Significance of Predictors
| pincome | pnetworth | |
|---|---|---|
| (Intercept) | 67843.438*** | −1589.162 |
| mom_age | 765.726*** | 4276.303*** |
| race_ethAAPI NonHispanic | −45621.940** | −57439.825 |
| race_ethAIAN Hispanic | −71898.806* | −135081.226 |
| race_ethAIAN NonHispanic | −50782.647** | −134459.990* |
| race_ethBlack Hispanic | −66498.424*** | −169547.969** |
| race_ethBlack NonHispanic | −54745.115*** | −152405.027** |
| race_ethOther Race Hispanic | −52600.891*** | −118811.326 |
| race_ethOther Race NonHispanic | −56368.568*** | −105934.771 |
| race_ethWhite Hispanic | −55943.174*** | −144242.355** |
| race_ethWhite NonHispanic | −47277.084** | −99882.141 |
| mom_educ_hs.L | 22019.499*** | 71591.805*** |
| mom_educ_hs.Q | 6059.871*** | 23474.098*** |
| mom_educ_hs.C | 1107.549 | −615.634 |
| mom_educ_hs^4 | −2121.900 | −14018.565*** |
| dad_educ_hs.L | 26076.534*** | 60049.340*** |
| dad_educ_hs.Q | 1673.669 | 12633.649* |
| dad_educ_hs.C | 1146.181 | 7439.006 |
| dad_educ_hs^4 | −357.175 | −4198.144 |
| has_retsav | 16538.399*** | 44285.502*** |
| owns_home | 12387.903*** | 68008.452*** |
| both_parents | −1833.451 | 21864.080*** |
| par_dec | −2769.377 | 14312.449 |
| Num.Obs. | 2973 | 2973 |
| R2 | 0.324 | 0.314 |
| R2 Adj. | 0.319 | 0.309 |
| AIC | 70546.8 | 78287.6 |
| BIC | 70690.7 | 78431.5 |
| Log.Lik. | −35249.385 | −39119.783 |
| F | 64.307 | 61.428 |
| RMSE | 34117.48 | 125418.88 |
Checking for multi-collinearity - correlation plot for numeric variables
age_resp mom_age has_retsav owns_home both_parents
age_resp 1.000000000 0.2115985 0.007221412 0.02020203 -0.0336063
mom_age 0.211598517 1.0000000 0.155942508 0.20824653 0.2347601
has_retsav 0.007221412 0.1559425 1.000000000 0.41849655 0.2698076
owns_home 0.020202029 0.2082465 0.418496546 1.00000000 0.3737127
both_parents -0.033606299 0.2347601 0.269807569 0.37371271 1.0000000
Checking for multi-collinearity - VIF and Successive addition of regressors
Checking for missings in the predictor variables
id pincome pnetworth mom_age mom_educ_hs dad_educ_hs
0 2361 2327 593 1039 3385
race_eth has_retsav owns_home both_parents
0 1158 1583 0
id pincome pnetworth mom_age mom_educ_hs dad_educ_hs race_eth
id NaN NaN NaN NaN NaN NaN NaN
pincome 1 0.000 0.330 0.903 0.839 0.579 1
pnetworth 1 0.320 0.000 0.903 0.837 0.651 1
mom_age 1 0.614 0.619 0.000 0.671 0.540 1
mom_educ_hs 1 0.633 0.635 0.812 0.000 0.363 1
dad_educ_hs 1 0.706 0.760 0.919 0.804 0.000 1
race_eth NaN NaN NaN NaN NaN NaN NaN
has_retsav 1 0.068 0.000 0.872 0.788 0.603 1
owns_home 1 0.340 0.303 0.896 0.817 0.587 1
both_parents NaN NaN NaN NaN NaN NaN NaN
has_retsav owns_home both_parents
id NaN NaN NaN
pincome 0.543 0.557 1
pnetworth 0.502 0.526 1
mom_age 0.750 0.722 1
mom_educ_hs 0.763 0.721 1
dad_educ_hs 0.864 0.807 1
race_eth NaN NaN NaN
has_retsav 0.000 0.105 1
owns_home 0.346 0.000 1
both_parents NaN NaN NaN
Checking for MCAR
The two pairs of boxplots on the edges of Figure 1 show distributions of pincome and pwealth when the other variable is present and when it’s missing. Variable pnetworth has somewhat lower mean in the subsample where pincome is missing than in the subsample where pincome is present. The distribution of pincome in the subsample where pnetworth is missing is shifted upward relative to its distribution in the subsample where pnetworth is present. In each case, these distributions are relatively similar, and the MCAR assumption is not implausible.
Multiple Imputation of Parental Income and Networth
| x | |
|---|---|
| id | |
| pincome | pmm |
| pnetworth | pmm |
| mom_age | pmm |
| mom_educ_hs | polr |
| dad_educ_hs | polr |
| race_eth | |
| has_retsav | logreg |
| owns_home | logreg |
| both_parents |
| id | pincome | pnetworth | mom_age | mom_educ_hs | dad_educ_hs | race_eth | has_retsav | owns_home | both_parents | |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| pincome | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| pnetworth | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| mom_age | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| mom_educ_hs | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| dad_educ_hs | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| race_eth | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| has_retsav | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
| owns_home | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| both_parents | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
Distribution Plots for Parent Income (non-imputed and imputed)
Distribution Plots for Parent Wealth (non-imputed and imputed)
Distribution Plots for Dad Education (non-imputed and imputed)
| race_eth | N | dad_educ_miss |
|---|---|---|
| AAPI Hispanic | 4 | 1 |
| AAPI NonHispanic | 156 | 37 |
| AIAN Hispanic | 18 | 16 |
| AIAN NonHispanic | 43 | 14 |
| Black Hispanic | 54 | 31 |
| Black NonHispanic | 2333 | 1411 |
| Other Race Hispanic | 944 | 398 |
| Other Race NonHispanic | 119 | 43 |
| White Hispanic | 819 | 303 |
| White NonHispanic | 4406 | 1131 |
Whisker Plots
Scatter Plots for both parent income and wealth (non-imputed is zero)
Imputation of Retirement Savings
| x | |
|---|---|
| id | |
| pincome | |
| pnetworth | |
| mom_age | |
| mom_educ_hs | |
| dad_educ_hs | |
| race_eth | |
| has_retsav | |
| owns_home | |
| both_parents | |
| retsav | pmm |
| id | pincome | pnetworth | mom_age | mom_educ_hs | dad_educ_hs | race_eth | has_retsav | owns_home | both_parents | retsav | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| pincome | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| pnetworth | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| mom_age | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| mom_educ_hs | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| dad_educ_hs | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
| race_eth | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| has_retsav | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| owns_home | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
| both_parents | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 |
| retsav | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 |